This report provides a detailed visual and statistical analysis of the given dataset, using various techniques like correlation heatmaps, histograms, scatter plots, and linear regressions.
This heatmap shows the correlation between numerical columns in the dataset. A value close to 1 indicates a strong positive correlation, while a value close to -1 indicates a strong negative correlation. A value close to 0 means no correlation.
The table below summarizes the basic statistics for each numerical column, including the count, mean, standard deviation, and the minimum and maximum values.
| Column | Count | Mean | Std | Min | 25% | 50% | 75% | Max |
|---|---|---|---|---|---|---|---|---|
| A_id | 4000.0 | 1999.5 | 1154.8448669265786 | 0.0 | 999.75 | 1999.5 | 2999.25 | 3999.0 |
| Size | 4000.0 | -0.50301462982675 | 1.928058688854979 | -7.151703059 | -1.816764527 | -0.5137025125000001 | 0.8055264495000001 | 6.406366899 |
| Weight | 4000.0 | -0.9895465445945 | 1.6025072141517547 | -7.149847675 | -2.01177029275 | -0.9847364865 | 0.03097644 | 5.79071359 |
| Sweetness | 4000.0 | -0.47047851978824995 | 1.943440658920452 | -6.894485494 | -1.7384250625 | -0.5047584635 | 0.8019219209999999 | 6.374915513 |
| Crunchiness | 4000.0 | 0.9854779038585 | 1.402757204211963 | -6.055057805 | 0.06276439525000001 | 0.9982494390000001 | 1.8942342170000002 | 7.619851801 |
| Juiciness | 4000.0 | 0.5121179684932501 | 1.9302856730942946 | -5.961897048 | -0.80128581525 | 0.5342186584999999 | 1.8359763875 | 7.364402864 |
| Ripeness | 4000.0 | 0.4982774280305 | 1.8744267757033417 | -5.864598918 | -0.7716768665 | 0.5034447135 | 1.76621164075 | 7.237836684 |
Histograms provide an overview of the distribution of values for each numerical column. The density curve helps visualize the probability distribution of the data.
Scatter plots show the relationship between two numerical variables. A pattern can help identify correlations, trends, or outliers.
Linear regression attempts to model the relationship between two numerical variables by fitting a line through the data points. The red line in the graphs represents the linear fit.